Introduction for OpenVemo: A Rapid Content Creation Platform
Project Origin
This project addresses repetitive workflows observed during video blog and short video production within our team. By studying audio/video content production patterns and integrating cutting-edge AI models, we aim to automate key creative processes while applying open-source technologies to broader scenarios.
Target Use Cases
User Input → Story Framework Generation → Video-optimized Script Conversion → Voice/Image Synthesis → Export Draft (for Human Refinement)
Core AI Model Selection
- DeepSeek-V3: Story framework generation & structured dialogue creation
- Kokoro TTS: Multi-character voice synthesis
- FLUX-1 dev: Context-aware image generation
Key Features
1. Theme-based Story Generation
@app.post("/generate-story")
async def create_story(theme: str):
"""Generate 200-300 word structured narratives via DeepSeek API"""
response = openai.chat.completions.create(
model="deepseek-chat",
messages=[
{"role": "system", "content": "You are a professional writer skilled in classical narrative structures..."},
{"role": "user", "content": f"Create a short story about '{theme}' with character conflicts and resolution"}
],
temperature=0.85
)
return format_story(response.choices[0].message.content)2. Natural Voice Synthesis
def tts_pipeline(text: str, voice: str) -> bytes:
"""11 preset voice options (expandable)"""
response = tts_client.audio.speech.create(
model="kokoro",
voice=voice,
input=text,
speed=1.1,
pitch=0.8
)
return response.contentCurrent Limitation: Most English Voice work well,Chinese and the other voices support requires community improvements
3. Context-aware Image Generation
@app.post("/generate-scene")
async def render_image(prompt: str):
payload = {
"model": "FLUX.1-dev",
"prompt": f"best quality, 4k, {prompt}",
"negative_prompt": "blurry, lowres",
"seed": int(time.time() % 1000)
}
response = requests.post(SILICONFLOW_ENDPOINT, json=payload)
return response.json()["images"][0]["url"]System Architecture
graph TD
A[User Request] --> B(FastAPI)
B --> C{Request Type}
C -->|Text Generation| D[DeepSeek API]
C -->|Voice Synthesis| E[Kokoro Engine]
C -->|Image Generation| F[SiliconFlow API]
D --> G[Data Processing & State Management]
E --> G
F --> G
G --> H[Response Output]Data Models
class Section(BaseModel):
text: str
voice: str
class StoryRequest(BaseModel):
theme: str
class ImageRequest(BaseModel):
prompt: str
sectionId: str
seed: int = 123Traffic Management
Rate Limiting Strategy (5 requests/day/IP):
@app.middleware("http")
async def rate_limiter(request: Request, call_next):
client_ip = request.client.host
today = datetime.now().strftime("%Y-%m-%d")
if client_ip not in rate_limit_store:
rate_limit_store[client_ip] = {"count":1, "date":today}
else:
record = rate_limit_store[client_ip]
if record["date"] != today: # Daily reset
record.update({"count":1, "date":today})
elif record["count"] >= 5: # Throttling
return JSONResponse(status_code=429, content={"error": "Daily limit exceeded"})
else:
record["count"] +=1
return await call_next(request)Roadmap
- Session-based user tracking
- Database integration for usage records
- Custom API endpoint configuration
- Enhanced multilingual support
Efficiency Metrics
| Process Stage | Traditional Workflow | OpenVemo | Improvement |
|---|---|---|---|
| Story Creation | 1-2 hours | 35s | 2000% |
| Voiceover | 1-2 hours | 55s | 6000% |
| Scene Imagery | 3 hours | 1min/image | 18000% |
Live Demo: openvemo.demo
Note: Service stability may vary due to AI provider limitations
OpenVemo-打造一款简单的故事内容快速创作工具
开发背景
项目源于最近部门小组在制作简单的视频博客、短视频内容时面临的不少重复性工作流问题,通过对音频,视频内容生产规律的研究,整合现有的AI大模型能力争取在现有工作的基础上实现更多流程的自动化,也希望能将open source的技术运用到更多的场景中。
确定身边人群的几个典型应用场景:
用户输入主题 → 生成故事框架 → 转换为适应视频的对话脚本 → 生成配音与配图 → 导出完整内容草稿(供后续进一步人工修改) 此次项目中的用到的大模型选型主要有以下几种
- DeepSeek-V3:故事内容框架与对话文本结构化生成
- Kokoro TTS:实现多角色语音合成
- FLUX-1 dev:用于生成内容所需的图像配图
工具用到的主要场景
基于主题的故事框架生成
工具的基本功能包括基于给定的主题,生成短片内容的故事框架,也是简单的框架脚本。
@app.post("/generate-story")
async def create_story(theme: str):
"""通过openai接口调用deepseek api 生成300字以内带一定基础叙述结构的故事"""
response = openai.chat.completions.create(
model="deepseek-chat",
messages=[{
"role": "system",
"content": "你是一个专业作家,擅长构建起承转合的故事结构..."
},{
"role": "user",
"content": f"创作关于'{theme}'的短篇故事,包含人物冲突和结局"
}],
temperature=0.85
)
return format_story(response.choices[0].message.content)生成的故事框架内容如下,目前设定能稳定输出在200-300字之间的故事框架。

尽可能拟人的语音合成
经过国内外多种tts生成模型的对比,最终选定了kokoro tts,因其角色声音在当前足够逼真,而且模型需要的空间,消耗在可接受的范围之内,最终选用了kokoro-fastapi-cpu的版本,容器能进一步运用vps上cpu的处理去生成tts的配音,同时项目音频存放于本地硬盘里一段时间,给用户选择在一定时间内(不离开页面),对配音进行下载加工再处理。
def tts_pipeline(text: str, voice: str) -> bytes:
"""目前11种音色转换,硬编码,后续考虑加入更多音色"""
response = tts_client.audio.speech.create(
model="kokoro",
voice=voice,
input=text,
speed=1.1,
pitch=0.8
)
return response.contentkokoro的不足之处在于,当前对于中文的支持还不够友好,期待未来社区的更新支持。 
图像生成选型
这一步为生成故事框架中每一步所需的图像,以往不少图像模型生成出来的作品不能很好的与提示词,主题拟合,得益于flux生图模型的开放,选用通过优化提示词,对上述生成的每一部分故事框架,处理为单独生图,页面最后也提供了打包整合,或是下载处理的选择。
@app.post("/generate-scene")
async def render_image(prompt: str):
payload = {
"model": "FLUX.1-dev",
"prompt": f"best quality, 4k, {prompt}",
"negative_prompt": "blurry, lowres",
"seed": int(time.time() % 1000)
}
response = requests.post(SILICONFLOW_ENDPOINT, json=payload)
return response.json()["images"][0]["url"]
数据模型与语音合成的实现
class GenerateRequest(BaseModel):
text: str
voice: str = "af_nicole"
class Section(BaseModel):
text: str
voice: str
class StoryRequest(BaseModel):
theme: str
class ScriptRequest(BaseModel):
story: str
class PodcastRequest(BaseModel):
topic: str
class ImagePromptRequest(BaseModel):
text: str
context: Optional[str] = None
class ImageRequest(BaseModel):
prompt: str
sectionId: str
seed: int = 123
class ImageSection(BaseModel):
id: str
text: str
class DownloadRequest(BaseModel):
images: List[dict]
theme: Optional[str] = None
class TranslationRequest(BaseModel):
script: str
# tts路由
@app.get("/voices")
async def get_voices():
voices = [
{"id": "af", "name": "Default", "language": "en-us", "gender": "Female"},
{"id": "af_bella", "name": "Bella", "language": "en-us", "gender": "Female"},
{"id": "af_nicole", "name": "Nicole", "language": "en-us", "gender": "Female"},
{"id": "af_sarah", "name": "Sarah", "language": "en-us", "gender": "Female"},
{"id": "af_sky", "name": "Sky", "language": "en-us", "gender": "Female"},
{"id": "am_adam", "name": "Adam", "language": "en-us", "gender": "Male"},
{"id": "am_michael", "name": "Michael", "language": "en-us", "gender": "Male"},
{"id": "bf_emma", "name": "Emma", "language": "en-gb", "gender": "Female"},
{"id": "bf_isabella", "name": "Isabella", "language": "en-gb", "gender": "Female"},
{"id": "bm_george", "name": "George", "language": "en-gb", "gender": "Male"},
{"id": "bm_lewis", "name": "Lewis", "language": "en-gb", "gender": "Male"}
]
return voicestts配音接口与本地存储处理
@app.post("/generate-and-merge")
async def generate_and_merge(sections: List[Section]):
async def generate_stream():
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
output_dir = BASE_DIR / "output" / timestamp
output_dir.mkdir(parents=True, exist_ok=True)
temp_dir = BASE_DIR / "temp" / timestamp
temp_dir.mkdir(parents=True, exist_ok=True)
audio_files = []
try:
for i, section in enumerate(sections):
yield json.dumps({
"type": "progress",
"current": i+1,
"total": len(sections),
"message": f"Generating audio for section {i+1}/{len(sections)}"
}) + "\n"
response = tts_client.audio.speech.create(
model="kokoro",
voice=section.voice,
input=section.text
)
temp_file = temp_dir / f"temp_{i}.wav"
response.stream_to_file(temp_file)
audio_files.append(temp_file)
await asyncio.sleep(0.1)
yield json.dumps({
"type": "status",
"message": "Merging audio files..."
}) + "\n"
combined = AudioSegment.empty()
for f in audio_files:
combined += AudioSegment.from_wav(f)
final_path = output_dir / "audio.wav"
combined.export(final_path, format="wav")
yield json.dumps({
"type": "complete",
"success": True,
"filename": f"output/{timestamp}/audio.wav"
}) + "\n"
except Exception as e:
yield json.dumps({
"type": "error",
"error": str(e)
}) + "\n"
finally:
for f in audio_files:
try: f.unlink()
except: raise HTTPException(400, detail="Invalid addio format")
try: temp_dir.rmdir()
except: raise HTTPException(400, detail="Invalid addio format")翻译部分
@app.post("/translate-podcast")
async def translate_podcast(request: TranslationRequest):
try:
response = openai_client.chat.completions.create(
model="deepseek-chat",
messages=[
{"role": "system", "content": "你是一名专业的地道的中文翻译家..."},
{"role": "user", "content": f"Translate this script:\n{request.script}"}
]
)
return {
"success": True,
"translation": response.choices[0].message.content
}
except Exception as e:
raise HTTPException(500, detail={"success": False, "error": str(e)})
@app.post("/translate-story-script")
async def translate_story_script(request: TranslationRequest):
try:
response = openai_client.chat.completions.create(
model="deepseek-chat",
messages=[
{"role": "system", "content": "你是一名专业的地道的中文故事翻译家..."},
{"role": "user", "content": f"Translate this script:\n{request.script}"}
]
)
return {
"success": True,
"translation": response.choices[0].message.content
}
except Exception as e:
raise HTTPException(500, detail={"success": False, "error": str(e)})关于流量管控
工具涉及的deepseek和siliconflow各AI服务资源还处于十分紧张的阶段,因此我在项目里对api的接口请求次数做了相应限制,主要是云托管的版本(5次每天/IP),当天清零,隔天重新再计时的策略,该部分设计中考虑中间件,计时器缓存的设计,由于项目为临时托管云服务器,同时考虑请求次数增多的用户体验情况,目前将计时器数据等缓存于内存中。
# 限流中间件核心逻辑
@app.middleware("http")
async def rate_limiter(request: Request, call_next):
client_ip = request.client.host
today = datetime.now().strftime("%Y-%m-%d")
# 计数器初始化
if client_ip not in rate_limit_store:
rate_limit_store[client_ip] = {"count":1, "date":today}
else:
record = rate_limit_store[client_ip]
if record["date"] != today: # 跨日重置
record.update({"count":1, "date":today})
elif record["count"] >= 5: # 触发限流
return JSONResponse(
status_code=429,
content={"error": "每日请求上限5次"}
)
else:
record["count"] +=1
return await call_next(request)总体技术架构一览
系统设计
A[用户请求] --> B(FastAPI)
B --> C{请求类型}
C -->|文本生成| D[DeepSeek api]
C -->|语音合成| E[Kokoro引擎]
C -->|图像生成| F[SiliconFlowapi]
D --> G[数据转换,组装输出,状态存储]
E --> G
F --> G
G --> H[响应输出]这里也说说项目后续的更新考虑:
- 结合session再实现分浏览器用户的精准计数
- 创建单独的定时任务每日清理计时器记录
- 用户数据考虑更改为数据库存储
- 增加用户可选的deepseek和flux生图api自定义,根据用量决定使用时长,不局限于本机api限制。
📊 相对之前项目的效率提升对比
根据项目的愿景,如果各项功能均能稳定运营,相对传统简单视频博客场景,都有了不少提升
| 创作环节 | 传统耗时 | 本系统耗时 | 效率提升 |
|---|---|---|---|
| 故事创作 | 1-2小时 | ~35秒 | 20x |
| 视频配音 | 1-2小时 | ~55秒 | 60x |
| 场景配图 | 3小时 | 1分钟(单图) | 200x |
最后放上项目体验地址
ps:由于deepseek和siliconflow flux服务均有限制且未稳定,线上地址供初步体验。
在线体验:openvemo